Executive Summary

The purpose of this vignette is to explore relationship of risk with other factors of World Bank’s projects.

Key takeaways

  • Risk is somewhat related to the region in which a project is being implemented. However, there are other, stronger predictors of overall risk rating, such as: the tenure of the project’s TTL (experienced ones are more likely to be trusted with high-risk projects), the scale of the project (as represented by the amounts committed), and the year when projects were approved (as a proxy of world events, changes in the Bank’s risk tolerance, etc.)

  • The strongest relationship was found to be in East Asia Pacific and macroeconomic and political risks. In these countries such risks are expected to be lower, as the correlations are negative at -0.23 and -0.27. The visualizations are to be found in the analysis below.

  • Overall risk rating is not a linear function of other specific types of risk. Institutional capacity is a type of risk that contributes the most to the overall risk rating accoding to two disticnt measures.

  • Fragility and conflict indicator is negatively related to the net value of the World Bank loans. It is hard to provide loans if it is uncertain whether the loanees will be in place once a situation is resolved.

Preparation of the dataset

The following packages are used:

library(exploratory)
library(janitor)
library(lubridate)
library(hms)
library(tidyr)
library(stringr)
library(readr)
library(forcats)
library(RcppRoll)
library(dplyr)
library(tibble)
library(rio)
library(plotly)
library(reshape2)
library(alluvial)
library(caret)

Tidy data

The provided datasets — project_data and risk_data — come in wide and long formats, respectively.

We bring these datasets together in a tidy format, where each column is a variable, and each row is a unique observation. In this case, the unique identifier is project_id.

# Creating a numeric representation of risk as `risk_numeric` to facilitate computation in R.
# This can be Low (1), Moderate (2), Substantial (3), or High (4)
risk_data <- risk_data %>% 
  mutate(risk_rating = 
  ifelse(risk_rating==c("L"),1,
  ifelse(risk_rating==c("M"),2,
  ifelse(risk_rating==c("S"),3,
  ifelse(risk_rating==c("H"),4,0)))))

# Reshaping the data and cleaning up
risk_data_wide <- dcast(risk_data, 
                        project_id + risk_rating_sequence ~ risk_rating_code,
                        value.var="risk_rating",
                        fun.aggregate=mean)
# Joining the data and getting rid of variables that have no variation
joined_data <- project_data %>% 
  inner_join(risk_data_wide, by = c("project_id" = "project_id")) %>%
  select(-scale_up, -len_instr_type)

Part 1. Risk and Regions

Interactive Map for Exploratory Data Analysis

Distribution of Risk


Disclaimer: The below analysis was performed on the entire universe of the data. Such risks as political, governance, and macro, may change momentarily once there is new administration in place. Therefore, the produced insights should be treated as generalizations from historical data.


Upon completing exploratory data analysis, we conclude that it will be optimal to produce an alluvial plot to visualize the many relationships between overall risk and regions.

Assumption 1: All risk evaluations are performed by staff, subjectively. Overall risk category is a qualitative assessment of the risk, which may or may not be a disctinct function of other types of risk.

Assumption 2: When making a decision on a project, Overall risk is the key factor. We will focus on it first, proceeding to study subcategories of risk later.

# Prepare data for visualization 
joined_data_freq <- joined_data %>% 
  group_by(risk_overall, region, fcs_indicator, proj_emrg_recvry_flg) %>% 
  summarise(freq=n()) %>% 
  filter(region != "OTH") %>% 
  arrange(desc(region))

# Create an alluvial chart
alluvial(joined_data_freq[,1:4], 
         freq=joined_data_freq$freq, 
         border=NA,
         hide = joined_data_freq$freq < quantile(joined_data_freq$freq, 0.5),
         col=ifelse(joined_data_freq$risk_overall == "4", "red", 
             ifelse(joined_data_freq$risk_overall == "3", "orange", 
             ifelse(joined_data_freq$risk_overall == "2", "cyan", "blue"))))

TODO legend

The above figure shows in color how different levels of risk are distributed across regions. Including those facing fragile, conflict, or emergency situations. These situations have been chosen to accompany regions because they are intrinsically related to specific geographies.

While it presents lot of information in a compact format, staff familiar with the dataset will be able to read it easily. For instance, it can be seen that:

  • A little over 50% of projects have a Substantial (3) risk rating. The second most frequent rating is Moderate. This shows most risk assessors refrain from making extreme judgements.
  • Therefore, it is not surprising that Substantial (3) and High (4) risk projects span across the regions in a way close to a normal distribution. Percentage breakdown of overall risk by region can be found below.
  • Africa is the biggest region where projects are implemented. It can be seen that over 2/3 of the projects there are of a Substantial (3) or High (4) risk.

Risk by Region# TODO fix the table

The takeaway of this table is that proportions of projects of each risk category are approximately the same across all regions. This is a piece of evidence showing that risk is not substantially or exclusively related to the region in which a project is being implemented.

Correlates of Risk

To better our understand of the relationship between risk and region we will look for more patterns inside the data.

Let’s create a correlation matrix that will help us understand which variables are similar based on how the underlying data varies.

# Performing one-hot encoding of the region and other variables. The purpose is to transform each value of each categorical feature into a binary feature {0, 1}
j_data_corr <- as.data.frame(joined_data)

j_data_corr$region <- as.factor(j_data_corr$region)
for(level in unique(j_data_corr$region)){
  j_data_corr[paste("reg", level, sep = "_")] <- ifelse(j_data_corr$region == level, 1, 0)
}

j_data_corr$fcs_indicator <- ifelse(j_data_corr$fcs_indicator == "Y", 1, 0)
j_data_corr$proj_emrg_recvry_flg <- ifelse(j_data_corr$proj_emrg_recvry_flg == "Y", 1, 0)

j_data_corr <- j_data_corr %>% select(-tl) 
# Running Spearman correlation analysis
var_corrs <- j_data_corr %>% 
  do_cor(which(sapply(., is.numeric)), 
         use = "pairwise.complete.obs", 
         method = "spearman", 
         distinct = FALSE, 
         diag = TRUE)

There are very few strong positive correlations in the above plot. The strongest of them are including an indicator of a project being in a fragile or conflict situation.

  • The 0.25 correlation between fcs_indicator and risk_overall shows that fragile locationas are somewhat associated with higher risk. While the relationship is positive and exists, it is weak.

  • The 0.25 correlation between African and Political and Governance risk hints at instability in the region, especially relative to the -0.28 political risk in East Asia Pacific. Similar situation is observed for macroeconomic risk in these regions. These are the strongest relationships between risk and region.

  • Other regions do now show as strong of a relationship with different types of risk.

We also observe few strongly negative correlations:

  • The -0.29 correlation between fcs_indicator and net_commit_amt, net value of the World Bank loans. It is hard to provide loans if it is uncertain whether the loanees will be in place once a situation is resolved.

  • The -0.95 correlation between approval_fy and risk_rating_sequence is trivial: there is less chance to assess the projects later in the lifycycle for the more recent ones.

  • The -0.62 correlation between net_commit_amt and grant is staightforward as well: the Bank is more likely to provide one type of aid over the other, either a loan or a grant. However, some exceptions apply.

Having these preliminary results we proceed to an advanced analysis.

Modelling with Extreme Gradient Boosted Trees

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1   0   4   0   0
##          2   5  52  47   5
##          3   6  71 140  18
##          4   0   1  12  17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5529          
##                  95% CI : (0.5012, 0.6038)
##     No Information Rate : 0.5265          
##     P-Value [Acc > NIR] : 0.1639          
##                                           
##                   Kappa : 0.2106          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.00000   0.4062   0.7035  0.42500
## Specificity           0.98910   0.7720   0.4693  0.96154
## Pos Pred Value        0.00000   0.4771   0.5957  0.56667
## Neg Pred Value        0.97059   0.7175   0.5874  0.93391
## Prevalence            0.02910   0.3386   0.5265  0.10582
## Detection Rate        0.00000   0.1376   0.3704  0.04497
## Detection Prevalence  0.01058   0.2884   0.6217  0.07937
## Balanced Accuracy     0.49455   0.5891   0.5864  0.69327

We have built a model of overall risk, and the above results pertain to 30% data that was held off to test it.

The results show that using XGBoost allows for a 0.69 accuracy of prediction of High risk. For both risk levels 2-3 the accuracy is 0.58, better than random assignment. The accuracy for risk level 1 is around 0.49. This is a good result, given that we are concerned the most with higher levels of risk. In this scenario it is more valuable to accurately predict high risks and potentially have false alarms, instead of missing a high risk completely. Moreover, there are only 11 cases of risk level 1, which is a cause of learning problems for the algorithm.

Variable Importance

#m_xgb_coef$importance <- format(m_xgb_coef$importance, scientific=F)
options("scipen"=100, "digits"=4)
p <- m_xgb_coef %>% filter(importance>0.00683778) %>% 
  ggplot(aes(x = reorder(feature, importance), 
             y = importance,
             fill = importance)) +
    coord_flip() +
    scale_fill_gradient(low = "gray", high = "red", "Variable\nimportance") +
    geom_bar(stat = "identity") +
    theme_bw() +
    xlab("Variable names") +
    ylab("")
ggplotly(p, tooltip=c("y"))

According to the plot above, the most important features in this dataset to predict risk are divided in two clusters:

High importance:

  • tl_since: Date the current project manager came on to the project
  • net_commit_amt: Value of the World Bank loan(s) associated with the project in millions of USD.

Low importance:

  • approval_fy: Fiscal Year when the project was approved
  • grant: Value of any World Bank grants associated with the project in millions of USD.
  • fcs_indicator: Indicates if the project is in a Fragile or Conflict Situation.
  • certain regions, countries, practices, etc.

To answer our question, some regions are better predictors of risk than others. The modelling exercise confirms the above considerations that East Asia Pacific (EAP) is a stronger predictor that some other regions. However, a new piece of information is that ECA is also noticeable. All of them, still, are in the low importance cluster.

Part 2. Types of Risk

In Part 1 we focused on the overall risk measure and its relation to region. We are tasked with exploring the association of risk and other measures. However, this has already been covered in the visualizations and analyses in Part 1. The results are valuable, yet there is an even more insightful discovery below.

For this section, we have prepared an analysis of how different types of risk influence the measure of overall risk. Thus, we use only types of risks to preduct the overall risk rating. This cannot be used in applied environments, because it presents a data leakage. In this case it is intentinal – to show that contribution of different types of risk is not even.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1   2   6   0   0
##          2   9  86  16   0
##          3   0  36 172  17
##          4   0   0  11  23
## 
## Overall Statistics
##                                              
##                Accuracy : 0.749              
##                  95% CI : (0.702, 0.792)     
##     No Information Rate : 0.526              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.564              
##  Mcnemar's Test P-Value : NA                 
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.18182    0.672    0.864   0.5750
## Specificity           0.98365    0.900    0.704   0.9675
## Pos Pred Value        0.25000    0.775    0.764   0.6765
## Neg Pred Value        0.97568    0.843    0.824   0.9506
## Prevalence            0.02910    0.339    0.526   0.1058
## Detection Rate        0.00529    0.228    0.455   0.0608
## Detection Prevalence  0.02116    0.294    0.595   0.0899
## Balanced Accuracy     0.58273    0.786    0.784   0.7712

Here we observe that the #1 contributor to overall risk is institutional capacity.

We run the correlation analysis again to magnify the relationships between types of risk.